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Abstract 

O 

We perform a non-asymptotic analysis on the singular vector distribution under 
Gaussian noise. In particular, we provide sufficient conditions of a matrix for its 
{SJ first few singular vectors to have near normal distribution. Our result can be used 

to facilitate the error analysis in PCA. 

< 

£ 1 INTRODUCTION 

The singular value decomposition (SVD) lies in the heart of various dimension reduction 
(DR) techniques, such as PCA, LE, LLE etc. In real world problems, noise from unknown 
1 — 1 sourses may largely change the dimension reduction result by changing either the principal 

^vq directions, or the data distribution along those directions. Aside from some unusual cases 

that involve nonlinear bias in the measurements, and because of the central limit theorem, 
it is always assumed that the noise vector in the raw data obeys the i.i.d. Gaussian 
distribution, which simplifies both data processing and error analysis. However, when 
doing dimension reductions, the Gaussian distribution is not preserved. This is because 
the low dimensional data is contained in the first few singular vectors of some predefined 
^ kernel, and elements of any singular vector are bounded by one, which prevent them from 

being Gaussian variables. In [2], it is pointed out in a statistical and asymptotic way 
that for a fixed data matrix, as the noise level goes to 0, the distribution of singular 
vectors would eventually become very close to a multivariate normal distribution. In this 
paper, we go further in this direction and provide non-asymptotic bound on the distance 
between the perturbed singular vectors and a multi-variate normal random variable. In 
applications, our result could be used in two ways: 

• To predict for a fixed data matrix and noise level, whether or not can one reliably 
(within a given confidence interval) assume that the noise after dimension reduction 
is Gaussian. 

• For a given noise level, to determine how many samples are necessary for the noise in 
the embedded space to be close (e.g., the distance is less than some given threshold) 
to Gaussian. 
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We state the mathematical description of our problem as follows. 

Problem: Suppose Y is an n x n noiseless data matrix, and Y = Y + eW is the observed 
noisy data. Their associated SVDs are: 

Y = (U h U 2 )^ S °J(^ 2 ) T , (1) 

and 

Y = {U U U 2 )^ ^)(VuV 2 f, (2) 

where the entries of W are i.i.d. iV(0, 1) and e > is an absolute constant. Due to 
the randomness of W , all the quantities in (2) are random variables whose measures are 
induced by those ofW. Under this setting, what is the distribution of JJ\ — U\? 



In this paper, without loss of generality, we focus only on the low rank and square Ys 
(i.e., £2 = 0). All our techniques can be carried through in high rank and rectangular 
cases. 

The perturbation problem of singular vectors has brought great attention in the field 
of statistics and numerical analysis. Davis and Kahan [3j studied the deterministic pertur- 
bation on Hermitain matrices and provides an upper bound for the rotation of singular 
vector subspaces caused by the perturbation. They showed that the span of a group 
of singular vectors as a subspace is changed by an amount which is propotional to the 
noise power and the reciprocal of eigenvalue gap. (the change is charaterized in canonical 
angle between subspaces, denoted by Z(Ui,Ui)). Wedin [12] extended this result to non- 
Hermitain cases. Dopico [5] proved that the left and right singular vectors is perturbed 
towards the same direction. Van Vu [UJ considered random perturbations (i.i.d. Bernoulli 
distribution) and provided a tighter upper bound for the canonical angles Z(Ui,Ui), which 
hold with large probability. The necessity of using the canonical angles to characterize 
the change is well illustrated in the following example in [TT] . Consider the following two 
matrices, 

"-(;:). H > 

Both matrices are diagonalizable, the eigenvectors of Y are (1,0) and (0, 1), while those 
of Y are (l/v2, l/\/2) and (l/v2, — l/v2), for any e. Hence, the difference of these two 
bases does not go to zero with e. 

Beside the canonical angle, some authors used another quantity: min \\Ui — UiM\\p 

M unitary 

to describe the changes of singular subspaces. The latter has been proved to be an upper 
bound of the former. 



In this paper, we use the pointwise matrix norm || ■ || max to bound the difference 
between two matrices. Our technique is valid and simpler for the Frobenius norm case so 
we omit it. However, if one tries to use the bound we provide on pointwise norm to derive 
a bound on Frobenius norm using the equivalence of norms, he will get a looser bound 
than directly running over our technique on Frobenius norm, and vice versa. 



The rest of the paper is organized as follows. In Section 2, we introduce notations and 
some existing theorems to be used. In Section 3, we state several lemmas and our main 
theorem. Section 4 contains the proof of the main theorem. In Section 5, we apply our 
result to an audio signal classification problem. 

2 PREVIOUS RESULTS 

Throughout this paper, we consider only square data matrices with nontrivial dimension- 
ality n > 2 and rank k < n. The variables Y, Y, Sj, E i5 C/j, C/j, Vi, Vi, i = 1,2, remain 
the same as defined in (1) and (2). We assume S 2 = 0. In addition, we assume that all 
the diagonal elements of the k x k matrix Si are bounded away from zero. 

For a given matrix A, we use Qa and Ra to denote the QR decomposition of A. The 
set of singular values of A is denoted by o~(A); the kth largest one is denoted by o~k(A); and 
the minimum singular value is by o~ m i n (A). We use W to denote the normalized Gaussian 
matrix whose entries are N(0, l/n). For any matrix, || ■ \\p denotes the Frobenius norm, 
and || • || 2 the spectral norm. In addition, we use || • || max to denote the component-wise 
maximum norm of a matrix, i.e., ||A|| max = max \Ai A. 

The following theorem is due to Dopico jS], who provided the worst-case bound on 
the Frobenius norm of the deviation of singular vector subspace under perturbation. He 
proved that this bound is proportional to the Frobenius norm of the perturbation as well 
as the reciprocal of the eigenvalue gap. 

Theorem 1 ([5]). Let Y and Y and their SVDs be as defined in (1) and (2). Define 

5 = min <^ min \ji - a min (Si) + a min (^i) \ ■ 
Ue^(Si),^e<T(s 2 ) J 

// 5 > 0, then 

min JllUtM - + jvjtf - < ^HE+ Hg !!i ; ( 3 ) 

M unitary " 

where R = (Y — Y)V\, S = (Y — Y) T Ui. Moreover, the left hand side of (3) is minimized 
for M = Z\Z\ where ZiSZj is any SVD ofUjUi + V^Vi, and in this case, the equality 
can be attained. 

In the proof of our main theorem (Theorem 4), we will frequently encounter the 
spectral norm of Gaussian matrices, which is known to be bounded in the following 
theorem. 

Theorem 2 ([10]). Let W be an n x k matrix with i.i.d. normal entries with mean and 
variance 1/n. Then, its largest and smallest singular values obey: 

P(X k {W) > 1 + + < e" n72/2 , 

In order to estimate the gap of eigenvalues in (3), we will make use of the following 
result. 



Theorem 3. (fT^ ) Let A be an n x k matrix and A + E be a perturbation of A. Let Aj and 
Aj be the ith largest eigenvalue of A and A + E, respectively. Then, fori = 1, ... min{n, k}, 



|Aj — Ai| < \\E\\ 2 . 

3 MAIN THEOREM 

We now ready to state our main theorem. 

Theorem 4. Let Y and Y and their SVDs be defined as in (1) and (2). Assume e, - 
are small enough such that the following defined quantities satisfies E 2 > l/(2\/k) and 
5i > (see below). Then with probability (with respect to the random Gaussian noise W) 
exceeding 1 — 3 e -( n - fc )' 9 + ln ( fc ( n + fc )) _4 e -™7 2 / 2 ? ^ ere exists an unitary kx k matrix M, such 
that for < (3 < 1/2, we have 



\\(U 1 -U 1 M)-eN)\\ msx < K ^ E ' +v / 2l(||^ 1 || ma x + ^4)(v / 2 + l) ^ + E 3 , (4) 

l-VkE 2 

where N is a Gaussian matrix defined by N = U^^WViT,^ 1 , and where 
Ei(e, E, k, n, 7) = e 3 ||S _1 |||(ct^tt 3 + 2a\a 2 + 2aia 2 a 3 ) + e 4 ||S _1 1| | (ctf ct:| + 2a\a 2 a 3 ) 

E 2 (e,E,k,n,j) = e 2 ||E^ \\ 2 2 a\ + 2e 3 ||E^ \\\a\a 2 + e 4 ||S^ \\ 2 a\al, 
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E 3 (e,Z,k,n^) = e\l + k){l + -^){n-k)-^\\i:- l 1 \\l 

E A (e, E, k,n, 7) = eaiHE^ 1 ^ + e 2 ||E 1 " 1 ||2aia 3 , 
Si = cr mi „(Si) - (2 + 7)e. 

are defined later in formulas (19), (20), (25), (29), (17), and where aci = 1 + 7 + 
a 2 = 2 + 27 + 2^/k/n, and a 3 = 2 + 7. 

Note that the order of Ei, ...,£'4 in terms of e and n are: E\ = 0(e 3 ), E 2 = 0(e 2 ), 
E 3 = 0(e 2 n~^), E A = 0(e) and 5 = 0(1). 

Corollary 1. If the eigenvalues ofY are all different, then (4) becomes 



\\(Ui - U x ) - eiV)|| max < X ^ E ' + v^dlt/xlUax + g 4 )(V2 + l) ^ + £? 3 . 

where 5 2 = min {min {^(Ei) - cr 3 -(Ei)|,i, j = 1, k,i ^ j} , cr min (Ei)} - 2(2 + j)e. 

Remark 1: We observe that the perturbation TJ\ — U% can be approximated by a 
Gaussian term eN only when it is the leading term in the error. The average magnitude 
of this term has asymptotic order of 0(e/yjn) as e and - going to 0, while the order of 
the leading terms on the right hand side is either e 2 ||L r i|| max or e 2 n~^ +l3 . Therefore, to 



ensure the Gaussian term is higher in order, we need the following condition on the (e, n) 
pair: 

e = o [ min \ n' 13 , — — r l). (5) 

V [ H^lUmax^ 2 J / 

Before going into the proof, we first establish three useful lemmas. The first one is 
an elementary observation in linear algebra, and the last one is a direct application of 
Theorem 2, so we omit their proofs. 

Lemma 1. If A = f^j then \\A\\ 2 < \/2max jVll^iill! + P21II2WP12H2 + P22II2} 

Lemma 2. Let T be the distribution of the product of two independent N(0, 1) normal 
random variables. Let X\, X-i,..., X n be i.i.d. random variables drawn from J 7 . If n > 2, 
we have 

P(X > n-^) < 1e~ n \ for any (3 G R, 

where X = Xij fn. 

Proof. If Xi ~ J 7 , one can verify that for all < 9 < 1, 



Ee 0x * ' 



i-e 2 ' 

We apply Markov's inequality (see eg. |SJ) to have: 

1 — F,p tx 

PCX > n~^) = P(e tx > e tn " 2 ) < -=^—, for any t G 

Letting t = n 1 ^ 2 in the above formula, we get 

P(X > n-^) < = e-^HEin-WXi) 



e" 

i=i 



n 2 
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Lemma 3. Let T be an n x n Gaussian matrix whose elements are i.i.d. N(0, -). Let T 
be written as 

rp _ ( T\\ T\2 

with Tu a k x k matrix. Then, with probability exceeding 1 — Ae~ <yn ~ k ^ ''I 2 , 

.. . fk [k 

\\T11 2<7 + 2\/-, \\T21 2 , \\T12 2 < 1 + 7+ \ ~i \\ T 2, P22 2<2 + 7. 
\ n V n 



4 PROOF OF THE MAIN THEOREM 



Proof. The idea of the proof is the following: Our goal is to charaterize the perturbed 
singular vector space Ui, which is a left orthogonal matrix and satisfies UjYV\ = Ei, 
meaning that it, together with V\, diagonalizes Y . Thus, if we can construct two other 
orthogonal matrices which also diagonalize Y with small error, then, by Theorem 1, they 
are expected to be good approximations to U\ and V\. 



We start the proof similarly to that in j2]. Define the random matrix C as follows, 

U T YV = E + eU T WV = E + eC, (6) 

where E - °\ and C - ( ° u ° 12 ) - ( U ? WVl U ^ WVi \ Due to the 

where L - ^ Q Q j and G - ^ ^ ^ j - ^ X fg W y x U%WV 2 ) ' Due t0 the 

invariant property of Gaussian matrix, C has the same distribution as W. The asymtotic 
order of e in the error term eC is not satisfactory, we thus want to further diagonalize C 
by using the following two matrices, 



G 21 Er 1 ) +e { -B 



with B = -C 22 Cf 2 ^ 2 + CaiEr^uSr 1 , and 

° = I + e \C^ J +£ {-IF 

with D = —Yj^ 1 C 2 \C22 + E^ 1 GnE|f 1 Gi2. Note that in doing so, we start to differ from 
[2] by defining P and O explicitly. The second order terms in P and O are crucial for 
obtaining (4). 

Keeping in mind that P and O are not yet unitary, we multiply the left hand side of (6) 
by P T on the left, and by O on the right, to obtain, 

P^YVO = (/ + e ( ^ , -% C - ) + «» ( ° B * 



x E 



C\\ C\2 \\ ( T ( — E x 1 Gi2 










M 


-D T 





C*21 C*22 J J \ \ ^12^1 

Ei + eGn + e 2 (E 1 1 C^ L G2i + Ci2C] Z 2 E 1 : ) « , 

" e G22-2 e 2 G 21 S r 1 G 1 2 )+jfc/rr ' (7) 



where the matrix .Err includes all the terms whose order on e is greater than or equal to 3, 
so Err is 0(e 3 ). Compare (7) with (6), we see the above operation has changed the order 
of error term from 0(e) to 0(e 3 ). If we denote the eigenvector subspaces corresponding 
to the two diagonal blocks by UP = ((UP) h (UP) 2 ), VO = ((V0) u (V0) 2 ), then they 
have the following expressions, 

(UP)! = U 1 + eU 2 C 2 i^ 1 -e 2 U 2 B, (8) 
(VO)! = V 1 + eVid&Ei 1 - e 2 V 2 D T . (9) 



Recall that Qa and Ra denote the QR decomposition of a matrix A. Now, we want to 
orthogonalize UP and VO. For that purpose, for both sides of (7), we multiply them by 

< -^ p - )l T ] on the left, and ( ^X, ^ 11 D -i J on the right. This way we obtain: 

U K (UP) 2 J V u K (VO)J 

ii^fr i)-e <io) 

where 

L\ = R(jjp) 1 (^i + e Cn + e2 (^! 1 C 2 T 1 C 21 + Ci 2 C^ 2 S 1 1 ))i? ( ,^ ^, 
£2 = R {up) -t(eC 22 - e 2 C 21 T, 1 1 C 12 )R^y ^ 2 , 

With a little abuse of notation, we continue to use Err to denote the error term in (10), 
though it was already changed by the left and right multiplication. We denote by MjEjM- 
the SVD of the L i: with i — 1,2. In this case, (10) becomes 

On the other hand, the left hand side of (11) also satisfies 

= (Q(up) 1 Q(up) 2 ) 7 Y (Q(vo)! Q(vo) 2 ) ■ (12) 
Combining (11) with (12), we obtain: 

(Q(t/p)! <5(t/p) 2 ) T ^(<5(yo) 1 Q(vo) 2 ) = (^ Ml ^ lMl M2 ^ 2M /)+ Err - 

Moving everything but Y on the left hand side to the right by multiplying the inverse of 
each matrix and utilizing the fact that (UP) 1 and (UP) 2 are orthogonal to each other, 
we derive: 

Y=(Q (UP)l M 1 Q {UP)2 M 2 ) ^(Q (V o h M[ Q {vo)2 M' 2 f + Err. (13) 

(13) combined with Theorem 1 imply that U\ and Q(up) 1 Mi are the first k left singular 
vectors of two very similar matrices (different by the Err term) and that they are close. 
Keeping this useful result in mind, we first turn to look at the big picture. 



Our final goal is to approximate by a Gaussian variable the difference between XJ\ and 
U-i up to a roation: Ui — U X M (we willl define the unitary matrix M explicitly later), 
which can be decomposed as: 

Ui - UiM = (p! - Q {UP)l M) + (Q (UP)l M - (UP^M) + {{UP) X M - U X M). (14) 

We insert (8) into (14) to get 

Ui ~ U,M = (U, - Q (UP)l M^ + (Q {UP)l M - (UP^M) + (e^C^T 1 - e 2 U 2 B)M. 

Here, the Gaussian term is the second to last term on the right, so we want to prove all 
other terms are small. For that purpose, we move the Gaussian term to the left and take 
the component-wise matrix norm on both sides, to have: 

\\U 1 -U 1 M-eU 2 C 2 {Z7 1 M\\ max 
< \\Ui- QiUP^MlU, + e 2 ||f/ 2J BM|| max + \\Q (UP)l M - (fLP)iM|| max (15) 
= I + 11 + III. 

Observe that the left hand side of (15) is exactly what we want to bound in this theorem. 
The rest of the proof is divided into three parts to bound each term J, /J, III on the 
right hand side. 



4.1 ESTIMATING I 

From (2) and (13), we obtain that (Ui, U 2 ) and (<5([/p) 1 Mi, Q( UP ) 2 M 2 ) are the left eigenspaces 
of Y and Y — Err, respectively. Thus, we want to use Theorem 1 to bound /. For this 
purpose, we fisrt calculate the key parameter S in that theorem. 

Recall Y = Y + eW. By Theorem 2 and Theorem 3, the ith largest sigular value of 
Y and that of Y obey, with probability over 1 — e~™ 7 / 2 , that 

\Xi- Ai| < e||W|| 2 < (2 + 7 )e, for i = 1, n. (16) 

Equation (16) implies the following lower bound on 5: 

S>a min (E 1 )-(2 + - f )e = S 1 . (17) 

Whenever 5\ > 0, we can apply Theorem 1 to the two SVDs in (2) and (13), to obtain 

J\\Err T U 1 \\l + II ErrV x \\ \ 
min WQfjjP^L-U^pK ± . (18) 

L unitary 

The matrix Err, defined in (7) and modified in (10) and (13), is essentially a sum of 
several products of Gaussian matrices. Applying Lemma 1 on Err and utilizing Lemma 
3, we obtain the following bound: 

ii j-i I, ^ v^-Bi(e,Si,A;,n,7) 
\\Err 2 < 



1 -E 2 (e, Ei,fc,n,7)' 



holds with probability over 1 — 4e ( n k ^ 2 / 2 whenever e is small enough such that E 2 < 1. 
Here, 

Ei(e, S, fc, n, 7) = e 3 ||Sj" 1 ||2(o; 1 Q;3 + 2a^a 2 + 2a 1 a 2 a 3 ) + e 4 ^^ 1 ^(a^a^ + 2a\a 2 a^) 
+ e^Sri^^as, (19) 



and 



771 ^2||v-l||2^2 1 o^3 II v-l||3^2^ 1 411^-1114^2^2 

hi 2 — e l^ctj + ze ||2j 1 l^c^a^ + e ||^1 ||2 a l a 2- 

2 + 27 + and a 3 = 2 + 7. 



2\ r kE l 



(20) 



where «i = 1 + 7 + y -, a 2 
Now, the right hand side of (18) has the following bound: 



WErr^fp + ll^rrFJI < ^2k\\Err\\ 2 < 



(21) 



1-^2 

We are now ready to define the rotation M which first appears in (14). Comparing (18) 
with the first term of (14), we have 



M = MiL, 

where the explicit form of L is given in Theorem 1. 
We plug (21) into (18), to obtain: 



(22) 



\U l - Q {UP)l M\\™x < \\Ui ~ Q{u P)i M\\f < ^ft E E \) ■ 



4.2 ESTIMATING II 

We start with breaking 77 into two parts: 
II = e 2 \\U 2 BM 



max 



< e\\,U 2 C 22 C[ 2 Yr x 
= e 2 (IV + V). 

We estimate IV and V separately. 



\U 2 C 2 i'E 1 1 CnS 1 1 1 



IV = 



(Ui, U 2 ) ^12^1 2 — UiC 2 T l Cj 2 Y, l 



^ II ^1 II2 



1 1 1 2 



(U U U 2 ) 
(U U U 2 ) 



u 21 \ qT 



22 



12 



+ \\u 1 czcr 



21 w 12llmax 



22 



12 



+ fc||^l||max||C 2 T 1 ^ r 



12 Umax 



(23) 



°21 



Observe that the entries of {U\, U 2 ) \ @ J are ^(0, 1) and are independent of those 

[C T \ 

in C\ 2 . Therefore we can apply Lemma 2 to each entry (Ui, U 2 ) \ ^ 21 J Cj 2 and those of 
C^C^ to get, with probability at least 1 - 2e^ n ~ fc )' 9+ln « n+fc )), with < (3 < |, 



IV < (1 + k)(n-k)-^ W^Wl 



For V, we first observe the following random upper bound, 

V < k\\T, 1 1 1| 2 1| U2C21 1| max || Cll|| max- (24) 

Using (24), the union bound, as well as the following inequality, 



00 2 
2 f ,2 , e~ x 
e~ l dt < 



X 

we can estimate the probability that V exceeds the value ^/2kn~ 1+ ^\\Yl^ |||, 

P(V > 2kn- l+p \\^ l \\l) 

< P(k\\E7%\\ C/2C 2 i|| max ||C u || max > 2kn- 1+ P\\^ 1 \\ 2 2 ) 

— -P(||£^2C2l||max||Cll||max > 2n 1+/? ) 

< fc(n + Jfe)P(Cii(l,l) > \/2n^) 

00 

= k(n + k)y — J e~^dt 

<■ -n? +\nk(n+k) 

where Cn(l,l) is the first element of C\\. Therefore, with probability exceeding 1 



V < 2kn~ 1+fS \\E7 1 



— 1 1 1 2 
l2> 



We combine the estimates of IV and V to get, with probability greater than 1— 3e 1 ^ n k ) l3 + ln ( k ( n + k )) ; 

// < e 2 (l + k)(l + -^)(n - k)-^\\Z?\\l = E 3 . (25) 

4.3 ESTIMATING III 

The following calculation shows how far away is UP from unitary. 

P T U T UP = P T P = I + e 2 ^r^OuSr 1 ^J^) + 0(e 3 ). 

Hence, 

(UPfiiUP)! = I + ^(EiC&QttEr 1 ) + 0(e 3 ). 

Following a procedure similar to the one we used to obtain the bound in (21), we derive 
the distance between this covariance matrix and the identity matrix in Frobenius norm: 

\\(UP)l(UP)x -I\\ F < VkE 2 (e,?:,k,n,>y). (26) 

Here, E\ was defined in (19). When e is small enough such that E 2 < Theorem 2.2 
in pQ] shows that the distance ||Q(t/p)i — (UP)i\\2 can be bounded by a function of E 2 : 

VkE 2 
l-VkE 2 



||g ([/ p )l -(t/p) 1 || 2 <(V2 + i 



Thus, 



HI = \\Q{u P)l M - (UP) 1 M\\ max 

< ^\\{UP) 1 \\ m ^\\(R^ p)i -i)M\\ 2 

= v^||(C/P)i|| max || J R^|| 2 ||/- J R( [ /p )l || 2 

< v / ^||(C/^)i|| max ( C r min ((C/ j p) 1 ))- 1 ||Q( C /^) 1 - (UP)ih. (27) 

To estimate (27), we first note that from (25) and the assumption that E 2 < we 
obtain \a min ((U P)\) — 1| < \. Therefore, 

^JJupV) < ^ (28) 

We insert (26) and (28) into (27), to arrive at the bound: 

/// < V2k(V2 + 1) — %- 1| (UP)! \\ max . 

Furthermore, from (8) and Lemma 3, it is straightforward to estimate: 

||(^)i||max < H^illmax + ell^CaiSr'Ib + e^l^Slla 

< ll^illmax + eaillE^lb + e^lE^Hlaiaa^ ||C/i|| max + E 4 . (29) 

We combine the above two inequalities to get: 

\/kEo 

III< v^(||(CM| max + £4)(V5+l) /r • 

1 - V kE 2 

We now aggregate the estimates of /, // and III to get (4) and (5). □ 



5 Application 

We use the M-PSK (Phase Shift Keying) modulation classification problem as an example 
to show how our result is used to make a sampling strategy and a new classification 
method. 

In the so-called adaptive modulation system, the modulation type is varying with 
time. When the condition of channels (such as fading or interference) changes, the trans- 
mitter seeks to find the modulation type which best adapts to the new environment. 
Meanwhile, if the receiver is not informed of these changes, a modulation classification 
procedure needs to be carried out as soon as the signal is received. Here, we consider the 
classification problem for the widely used MPSK type of modulations which conveys the 
data by changing the phase of a carrier wave. Here M stands for the number of available 
phases and usually takes value 2, 4, 8, 16 or 32. An MPSK signal s(t) has the following 
mathematical representation: 



s(t) = ^p T {t - nT) cos(27r/ c t + n + C ) + w(t), 
k 



(30) 



where T is called the symbol period/ duration; p? is a function supported on [0, T] and 
is called baseband waveform; the frequency f c is called carrier frequency, and 9 C is the 
carrier phase. All these parameters, whether assumed known or not, are fixed for the 
duration of the signal. w(t) denotes an additive white Gaussian noise with two sided 
power spectral density No/2. The information is encoded in the phase parameter 9 n e 
9 M = = 0, 1,...,M-1}. M denotes the possible choice of phase shifts. If M = 2 m , 

then each symbol period can encode m binary bits. The most popular cases are M = 2 
and M — 4, where the modulation is called 2-PSK and 4-PSK or more often BPSK and 
QPSK, respectively. The classification among M-PSK modulations is defined to be the 
process of detecting the right value of M from a received noisy signal s(t). 

Among the literature about MPSK classifications (see e.g. [6], [7J, [9]), there exist 
many different types of model settings, ranging from the fully blind classifications, i.e., 
where all the parameters in (30) are unknown, to the constrained classfication problems 
where M is the only unknown parameter. For illustration purposes, we consider a simple, 
yet nontrivial, partially blind model assuming that T and Nq are known, pt = X\o,t], 
f c = k/T for some unknown integer k, 9 C is unknown, and 9 n is chosen unifromly from 0. 

Suppose we digitize the data in the following way. We take LN uniform samples from 
iV periods of s(t) and store them in an L x N matrix Y, where Y(l,n) denote the Zth 
sample of the nth period, which has the expression: 



where W is a Guassian noise matrix with i.i.d. entries. 

When noise is absent (i.e., W = 0), each column of Y is a function of 9 n . Since 9 n has 
at most M choices, the columns of Y also have only M patterns. If we deem each column 
as an L dimensional data point, then it merely is a high dimensional embedding of a 
zero dimensional parameter space. In other words, the L dimensional graph of Y consists 
only of M points. When noise is added, these M points becomes M clusters. Hence, 
the classification problem of finding the correct M is reduced to a clustering problem of 
finding the total number of clusters. 

The complexity of all well known clustering methods, such as k-means and Mean shift 
grows exponentially with dimension. Therefore, for large data sets, a preprocessing step 
of dimension reduction is necessary. The dimension reduction procedure has another two 
advantages over other well known methods (e.g., [6], [7], [9]) : 

• no carrier removal procedure is needed, relaxing the digitization rate to below the 
Nyquist rate of the carrier frequency. 

• classification and detection are completed simultaneously. 

For our problem setting, PCA is the most suitable technique for the following reasons: 




(31) 





Figure 1: Constellation of BPSK, QPSK, 8PSK and 16PSK modulations 



• The signal has Gaussian noise and PCA is just a linear-Gaussian latent variable 
model. 

• This is a linear model when we deem (cos(#), sin(#)) as parameters of Y (we will 
provide more discussion on this observation later). In telecommunications, the 2- 
D graph of all evaluations of (cos(6 l ), sin(0)) with 9 G O is formally called the 
constellation diagram of the MPSK modulation (see Figure 1), and can be used as 
an identifier of the modulation type. 

Therefore, the idea of our method is using PCA to map the high dimensional data to the 
two-dimensional constellation diagram on which clustering algorithm is applied to find 
the true cluster quantity. 

Now we provide more detail to legistate the linear model observation. 

By definition, the matrix Y has the following decomposition: 

y = ui:v T + w } 

where for I = 1, .., L, j = 1, 2, and n = 1, .., N, 



s 



It is immediate to verify that when L \ f c T, we have U T U = I2, and V is nearly orthogonal 
since we have assumed that 9 n are choosen unifromly at random from the parameter space 
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Figure 2: Dimension reduction result for BPSK, QPSK, 8PSK, 16PSK modulations, 
SNR=14, numbers of samples=4600 



(rigorous calculation of the distance between V and an orthogonal matrix can be found 
in [8]). This decomposition clearly shows how the noiseless part of Y linearly depends 
on (cos(0), sin(0)). When noise is absent, PCA does the job of seperating U from T,V T , 
where the latter is just a scaled version of the constellation. With noise is added, as 
explained earlier, each individual point in the graph turns into a cluster (Figure 2). Even 
though it is quite obvious to a human observer how many clusters there are in Figure 2 
without any prior knowledge, the exisiting clustering algorithms are surprisingly inferior 
by requiring either the number of clusters or the cluster radius as input. When the number 
of clusters is unknown as in our setting, many previous work suggest to do brute force on 
all the possible numbers (which are 2, 4, 8, 16, 32, maybe also 64, 128), and compare the 
clustering results in some ad hoc way to decide which is more likely. A more pleasant way 
to do this is by finding the cluster radius, which is exactly the place to use our results in 
this paper. 

Before applying our results, we first normalizate the singular values of Y by setting 
Y = 2F/ v / ZiV, so that the singular values of Y do not change with the matrix size, 

Y = UV T + -^L=W = UV T + -= ■ -=W. (34) 

y/LN VI y/N 



In hardware implementations, a larger value of L is usually more difficult to realize than 
that of N, because L corresponds to the sampling rate and N is the sample duration. 
Hence, we assume L < N. A generalized version of Theorem 4 for rectangular matrix 
can be similarly built, by padding zeros to form an N x N matrix. Since now the factor 
in (34) is a normalized Gaussian matrix, the other factor, denotes the energy 
of noise, corresponding to e in Theorem 4. As L — > oo, e — > 0. Theorem 4 implies that for 
a given matrix, and a given confidence level, there exists a threshold a such that, as long 
as |; = e < a, we can assume the first two eigenvectors of Y have multivariate normal 
distribution. In general, a is a function of both L and N, which makes the inquality 
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Figure 3: Clustering result: Small circles denote the low dimension representation of the 
data points returned by PCA. The eight big circles denote the theoretically predicted 
cluster radius. 



■| < a(L,N) difficult to solve. Fortunately, in our model, where \/iVj| V|| mai x is uniformly 
bounded for all N, it can be verified from (5) that a is universal for all sizes of Y. Thus 
solving the simple inequality < a gives us the feasible region of L > 4 /a 2 . 

With any feasible L, we can apply the Mean shift clustering method with Gaussian ker- 
nel. By some elementary calculations, one can derive that the 95% percentile of the Gaus- 
sian noise on each data point in the embedded space equals 2.45^2^(1 - 2/N)/(LN).. 
We set this number to be the radius. 

Since the rank of Y for BPSK signals is one, then trying to invert the second singular 
value in Theorem 4 makes the problem near singular and cause large error in the second 
dimension as shown in Figure 2. Fortunately, this case is easy to be recogonized by simply 
examing whether the second singular value of Y is much smaller than the first one. 

In our first experiment, we generate a QPSK signal with carrier frequence of 1GHZ, 
symbol rate 10MHZ, and damped by AWGN with SNR = 10. We use the sampling rate 
21 samples per symbol (much lower than the Nyquist rate of the carrier frequence, and 
satisfies L \ f c T) and sample 200 symbols. In Figure 3, the result of PCA is plotted, 
together with a circle whose radius is the 95% percentile predicted by the theorem, and 
whose centers are those found by the meanshift algorithm. We can see that the prediction 
of radius is quite accurate. 

In our second experiment, we let the SNR decrease and examine the performance of 
the above algorithm. An classification is deemed as successful only when the number of 
clusters returned by the Mean shift algorithm M strictly equal to the true type M. The 
result is plotted in Figure 4. As expected, when noise grows, the eigenvector distribution 
deviates from the Gaussian distribution and the predicted radius becomes too small for 
the algorithm to find the correct M. 

6 CONCLUSION 



In this paper, we provided a condition under which the perturbation of the principal 
eigenvectors of a matrix under Gaussian noise has a near-Gaussian distribution. The 
condition is nonasymptotic and is useful in application. We provided an simple example 




Figure 4: The success rate for BPSK, QPSK, and 8PSK modulations with respect the 
SNR 



of audio signal classification problem to illustrate how our theorem can be used to make 
sampling strategy and to form new classification technique. More details about this new 
classification scheme is discussed in j3]. 
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